HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Reseach on the problem of Hanoi
landmark recognition using images
NGUYEN THI SAO MAI
Major: Computer Science
Specialization: Data Science
Supervisor:
Dr. Dang Tuan Linh
School:
Information and Communication Technology
HA NOI, 2023
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Reseach on the problem of Hanoi
landmark recognition using images
NGUYEN THI SAO MAI
Major: Computer Science
Specialization: Data Science
Supervisor:
Dr. Dang Tuan Linh
School:
Information and Communication Technology
HA NOI, 2023
Signature
CNG HÒA XÃ HI CH NGHĨA VIỆT NAM
Độc lp T do Hnh phúc
BN XÁC NHN CHNH SA LUN VĂN THẠC
H và tên tác gi luận văn: Nguyn Th Sao Mai
Đề tài luận văn: Nghiên cu bài toán tìm kiếm thông tin địa danh Ni
bng hình nh
Mã s SV: 20212607M
Tác giả, Người hướng dn khoa hc Hội đồng chm luận văn xác nhn tác
gi đã sửa cha, b sung lun văn theo biên bn hp Hội đồng ngày 28/10/2023
vi các ni dung sau:
1. Các hình nh tham kho t tài liu cn trích dn, s dụng kích thước t
l hình nh phù hp hơn, tránh đ các khong trng nhiu trong luận văn
((theo ý kiến ca hi đng và thy phn bin 1)
- Học viên đã b sung đầy đủ trích dn vào các hình nh tham kho trong
luận văn. Ngoài ra khong trng nh trong lun văn cũng đã đưc loi b.
2. Phần đánh giá hiệu năng cần nêu t đánh giá trên tập d liu
Validation test ? s ng mu (theo ý kiến ca hội đồng và thy phn bin
1)
- Học viên đã bổ sung mô t đánh giá hiệu năng trên tập d liu Validation
vi s ng mẫu 617 trong chương V trang 36-37 ca luận văn.
3. B sung thêm ni dung mô t v gii pháp trin khai t bài toán nhn dng
nh landmark thành ng dng tìm kiếm thông tin địa danh Nội để phù
hợp như mục tiêu trong tiêu đ luận văn. (theo ý kiến ca hội đồng và thy
phn bin 1)
- Học viên đã bổ sung mô t v gii pháp trin khai t bài toán nhn dng
nh landmark thành ng dng tìm kiếm thông tin địa danh Ni để phù
hp vi mục tiêu trong tiêu đề luận văn. Gii pháp này th nhìn thy
trong mục 2 chương VI trang 40-41 ca luận văn. Và cũng thể xem thêm
ti phần bên dưới đây:
T hình nhn diện địa danh ti Ni đã được nghiên cứu, trong tương
lai, hc viên đề xut mt giải pháp để chuyển đổi thành mt ng dng nhn
diện địa danh ti Ni vi ng dụng mobile app server u tr như
sau:
Ngưi dùng: Trong giao diện người dùng, cho phép ngưi dùng ti nh
lên hoc chp ảnh trên điện thoi
ng dng: Gi ảnh đến máy ch
Máy ch: Nhn nh và nhn din nh
Máy ch: Gi kết qu tr li ng dng
ng dng: Nhn kết qu, hin th trên giao din
Ngưi dùng: Nhấn nút để xem thông tin v địa danh
4. Chnh sa li b cc luận văn (theo ý kiến ca hội đng thy phn bin
2)
- Học viên đã bổ sung thêm nội dung 03 chương: Proposed Method,
Dataset, Experiments
5. Chương 3: Cần phân tích s la chn các tham s (theo ý kiến ca hội đồng
thy phn bin 2)
- Hc viên đã bổ sung phân tích s la chn các tham s o chương V
mc 5.2 trang 34, 35 ca luận văn.
Trong đó:
- Lp Dense Layer1 s dụng 128 đơn vị vi hàm kích hot là ReLU. Lp
Dense Layer2 s dụng 15 đơn v vi hàm kích hot softmax. 128 units
lp dense layer 1 mt giá tr đã đưc chn da trên th nghim
điều chỉnh hình để đạt đưc s cân bng giữa độ phc tp ca
hình kh năng huấn luyn không gp vấn đề overfitting. Cũng
như dựa trên thc tế s ng mu học viên đã thu thập được trong
b d liu ca bài toán. Lp Dense cui cùng trong hình được thiết
kế có 15 đơn v vì s ng lớp đầu ra mong mun là 15 ( tương đương
vi mc tiêu nhn din 15 class ảnh địa danh Ni). Hàm kích hot
ReLU được s dụng đ gim thiểu lượng tính toán, tăng kh ng
học đặc trưng phi tuyến tính trong mạng -ron. Hàm kích hot softmax
được s dng trong lớp này để chuyển đổi các đầu ra thành xác sut, vi
mi giá tr đầu ra tương ứng vi xác sut ca lớp tương ứng.
6. Cn gii thích k hơn v kết qu thc nghim (theo ý kiến ca hi đng và
thy phn bin 2)
- Học viên đã bổ sung gii thích k hơn về kết qu thc nghim chương
V, kh 2 trang 36 và kh 1 trang 37.
7. Trình bày lun gii v s liên quan gia hai bài toán gii quyết. Trình
bày do la chn Yolo 4 Tiny, Yolo 7 Tiny, Yolo 5. B sung kết qu so
sánh vi các gii thut khác.
- Trong luận văn học viên ch s dng mt i toán bài toán nghiên cu
nhn dạng địa danh Nội. Trong đó dùng phương pháp dựa trên
CNNs. Không s dụng các phương pháp Yolo 4 Tiny, Yolo 7 Tiny ng
như Yolo 5.
Ngày 20 tháng 11 năm 2023
Giáo viên hưng dn Tác gi luận văn
CH TCH HI ĐNG
ACKNOWLEDGMENTS
First and foremost, I would like to express my deepest gratitude to my advisor,
Dr. Dang Tuan Linh. He has been incredibly supportive to me from the moment I
started this project until I wrote the project report today.
Balancing work and additional studies with a young child, there were many nights
when I contemplated giving up. Dr. Linh consistently checked in on my research
progress and encouraged me to persevere and complete the project.
I know that I am not as proficient as my peers, but at least on this challenging
journey, I did not give up. Additionally, I want to extend my gratitude to Cao Quoc
Khanh and Tran Chi Cuong, who have been my companions and supporters not
only in our work but also in my life.
Lastly, I want to express my thanks to my family, my mother, my husband, and
especially my little daughter. I couldnt spend much time with you all during this
per iod. I am sorry and thank you very much. Your smiles are a powerful motivation
for me. Whenever I wanted to give up, I thought, how could I teach my child if I
quit? I chose not to give up so that I could teach her to always strive to complete
her tasks.
I wish Dr. Dang Tuan Linh, Khanh, Cuong, my family, and my little daughter good
health and all the best for the people I love.
i
ABSTRACT
This thesis provides the research on “Hanoi Landmark Recognition“. Based on the
statistics provided by the General Statistics Office, in the first nine months of 2023,
domestic tourist arrivals are estimated to reach 15.7 million, an increase of 20.2%
compared to the same period in 2022. The number of international tourists visiting
Vietnam is steadily increasing. There is a high demand for recognizing landmarks.
To my knowledge, there is only one application for aiding tourists in recogniz-
ing attractions in Hanoi, which is Google Lens. There are some applications for
supporting tourists in finding ways to provide information about Hanoi. Of all the
applications that have been investigated by this thesis, MeTrip - Trip planner and
Sygic travel maps are two outstanding applications that serve these purposes.
The approach I have chosen is to create a dedicated dataset for Hanoi landmarks,
captured under various lighting conditions, including early morning, midday, and
evening. This dataset allows us to utilize computer vision techniques, particularly
Convolutional Neural Networks (CNNs), for feature extraction. I have selected this
approach due to its effectiveness in image recognition tasks.
The main contribution of my thesis is to build a dataset for landmarks in Hanoi.
And the second is to find a suitable algorithm, benchmark for the created dataset,
and landmark recognition problem with an accuracy of at least 80%.
The achieved results include the creation of a dataset comprising images in var ious
lighting conditions across 15 classes, encompassing 3108 samples, an accuracy of
the model greater than 95%.
Student
(Signature and full name)
ii
TABLE OF CONTENTS
CHAPTER 1. OVERVIEW ..................................................................... 1
1.1 Motivation ........................................................................................... 1
1.2 Objectives and tentative solution ............................................................ 2
1.2.1 Objectives ................................................................................. 2
1.2.2 Tentative solution....................................................................... 2
1.3 Contribution......................................................................................... 2
1.3.1 Self-Captured Image Dataset....................................................... 2
1.3.2 Landmark recognition models ..................................................... 3
1.4 Thesis organization............................................................................... 3
CHAPTER 2. LITERATURE REVIEW .................................................. 5
2.1 Related work ........................................................................................ 5
2.1.1 Landmark Recognition problems................................................. 5
2.1.2 Landmark recognition dataset ..................................................... 6
2.1.3 Technical................................................................................... 7
2.2 Background.......................................................................................... 8
2.2.1 VGG16 ..................................................................................... 8
2.2.2 DenseNet .................................................................................. 9
2.2.3 ResNet ...................................................................................... 9
2.2.4 MobileNet................................................................................. 11
2.2.5 Loss Function ............................................................................ 17
2.2.6 Supported Library...................................................................... 19
CHAPTER 3. PROPOSED METHOD..................................................... 22
3.1 Over view ............................................................................................. 22
3.2 Data Preprocessing ............................................................................... 22
3.3 Feature Extractions ............................................................................... 23
3.4 Landmark Recognition.......................................................................... 25
CHAPTER 4. DATASET ......................................................................... 28
4.1 Over view ............................................................................................. 28
4.2 Data Collection .................................................................................... 28
4.3 Data Statistics....................................................................................... 29
CHAPTER 5. EXPERIMENTS ............................................................... 34
5.1 Environment ........................................................................................ 34
5.2 Experiment .......................................................................................... 34
CHAPTER 6. CONCLUSION AND FUTURE WORK ............................ 40
6.1 Conclusion........................................................................................... 40
6.2 Future work.......................................................................................... 40
REFERENCE .......................................................................................... 44
APPENDIX .............................................................................................. 46
A. APPENDIX......................................................................................... 46
LIST OF FIGURES
Figure 2.1 Architecture of VGG16 VGG16 . . . . . . . . . . . . . . . . 8
Figure 2.2 A 5-layer dense block with a growth rate of k = 4. Each
layer takes all preceding feature-maps as input [5] . . . . . . . . . . 10
Figure 2.3 A deep DenseNet with three dense blocks. The layers be-
tween two adjacent blocks are referred to as transition layers and
change feature-map sizes via convolution and pooling [5] . . . . . . 10
Figure 2.4 Residual block [20] . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.5 Every residual block has two 3x3 conv layers [20] . . . . . . 12
Figure 2.6 Periodically double of filters and downsample F(x) spatially
using stride 2(/2 in each dimension) [20] . . . . . . . . . . . . . . . 13
Figure 2.7 No FC layers at the end (only FC 1000 to output classes) [20] 14
Figure 2.8 Depthwise separable convolutions visualization [21] . . . . . 15
Figure 2.9 Pointwise convolutions visualization [21] . . . . . . . . . . . 15
Figure 2.10 Bottleneck residual block of MobileNet v2 [18] . . . . . . . 16
Figure 2.11 Building on MobileNetV3, the proposed segmentation head,
Lite R-ASPP, delivers fast semantic segmentation results while mix-
ing features from multiple resolutions [23] . . . . . . . . . . . . . . 17
Figure 3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 3.3 Feature Extractions . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.4 Feature Extractions Details . . . . . . . . . . . . . . . . . . . 24
Figure 3.5 Landmark Recognition . . . . . . . . . . . . . . . . . . . . . 26
Figure 3.6 Landmark Recognition Details . . . . . . . . . . . . . . . . . 26
Figure 4.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 4.2 One Pillar Pagoda . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.3 Hanoi Opera House . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.4 The Huc Brige . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 4.5 Images of 15 attractions from the dataset . . . . . . . . . . . 32
Figure 4.6 Distribution of image classes . . . . . . . . . . . . . . . . . . 33
Figure 5.1 Comparison between top-1 accuracy and model size . . . . . 36
Figure 5.2 Comparison between execution time and model size . . . . . 37
Figure 5.3 Accuracy and Loss function of ResNet-50 [17] [4] . . . . . . 38
Figure 5.4 Accuracy and Loss function of VGG16 [3] . . . . . . . . . . 38
iii
Figure 5.5 Accuracy and Loss function of DenseNet [5] . . . . . . . . . 39
Figure 5.6 Accuracy and Loss function of MobileNet_v2 [18] . . . . . . 39
Figure 6.1 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . 41
iv
LIST OF TABLES
Table 2.1 Comparison of Google Lens, MeTrip, and Sygic Travel Maps 5
Table 2.2 MobileNet v2 architecture and configuration [22] . . . . . . . 16
Table 2.3 Supported Library . . . . . . . . . . . . . . . . . . . . . . . . 19
Table 4.1 Overview 15 classes dataset . . . . . . . . . . . . . . . . . . . 29
Table 4.2 Statistical information about the dataset . . . . . . . . . . . . 29
Table 4.3 Number of samples per each class . . . . . . . . . . . . . . . . 31
Table 5.1 Hyperparameter . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Table 5.2 The best acccuracy and execution times of Models . . . . . . 35
Table 5.3 The weight of the models (MB) . . . . . . . . . . . . . . . . . 36
v
LIST OF ABBREVIATIONS
Abriviation Full Expression
AlexNet Adaptive Moment Estimation
AlexNet AlexandraNet
CNNs Convolutional Neural Network
DenseNet Densely Connected Convolutional
Network
GPU Graphics Processing Unit
ReLU Rectified Linear Unit
ResNet18 Residual Network 18
ResNet50 Residual Network 50
Softmax Soft Maximum
VGG16 Visual Geometry Group 16-layer
vi
CHAPTER 1. OVERVIEW
1.1 Motivation
There are three main reasons why I chose the topic “Research on the problem
of Hanoi landmark recognition using images”: tourism and travel, technological
advancements, and personal interests.
First and foremost, Hanoi is a popular tourist destination, attracting visitors from
around the world. In the f irst nine months of 2023 [1], domestic tourist arrivals are
estimated to reach 15.7 million [2], an increase of 20.2% compared to the same
per iod in 2022. The average occupancy rate of hotel establishments in Hanoi is
projected to be 61% for the same 9-month period in 2023, a 27% increase com-
pared to the corresponding period. This indicates that the number of staying guests
in Hanoi has increased. An application that can recognize landmarks can enhance
the travel experience by providing information and context about these sites.
Secondly, advances in computer vision and machine learning have made it possi-
ble to develop robust landmark recognition systems. The development of computer
vision in the context of landmark recognition has seen remarkable progress in re-
cent years. This field focuses on teaching machines to understand and recognize
landmarks and significant locations in images and videos. Several key factors have
contr ibuted to this advancement: Deep Learning, Large Datasets, Improved Hard-
ware and Transfer Learning. Overall, the continuous advancements in computer
vision, coupled with the increasing availability of data and improvements in hard-
ware, have led to substantial progress in landmark recognition. This technology
has broad applications in fields like tourism, navigation, cultural preservation, and
urban planning, making it an exciting and evolving area of research and develop-
ment. Exploring this technology in the context of Hanoi can showcase the potential
of these advancements.
Last but not least, according to my personal preferences, Hanoi is a place where I
have lived for nearly 15 years, with numerous memories, and I want to do some-
thing meaningful for this place. And I found my passion in computer vision.
The development of the transportation industry has made moving from one place
to another easier for mankind, leading to an increasing demand for travel. Since
then, many applications that support travelers on their trips have been born. They
have helped travelers a lot in finding accommodations and restaurants and planning
their trips. However, some applications focus only on finding hotels and restaurants
without helping travelers find places of interest, some may help find attractions, but
do not provide necessary information or do provide some fee. Moreover, some of
1
CHAPTER 1. OVERVIEW
the applications do not have a feature of recognizing landmarks and sometimes
causes mistakes for travelers. Therefore, developing research directions for land-
mark recognition in Hanoi is essential.
1.2 Objectives and tentative solution
1.2.1 Objectives
Therefore, this thesis will develop a service for landmark recognition in Hanoi,
and tourists in Hanoi can use it as a simple tour guide with objectives as follows:
Build a dataset for landmarks in Hanoi. This dataset contains images in differ-
ent lighting conditions in at least 15 classes with a minimum of 2500 samples.
Find a suitable algorithm, solution for the created dataset, and landmark recog-
nition problem with an accuracy of at least 80%.
1.2.2 Tentative solution
From the motivations outlined in Section 1.1, I propose the following tentative
solutions for the Hanoi landmark recognition problem:
Data Collection: These images can be collected from 02 sources. The first
one is public databases and social media on the Internet. The second one is
taking photographs by myself. This dataset contains images in different light
conditions and angles to give the f inal result the best possible. 15 landmarks
in Hanoi were chosen: Hang Dau water tower, flag tower of Hanoi, Temple of
Literature,Quan Thanh temple, etc. Those landmarks are famous for the his-
tor ical story behind them; they have great spiritual significance and represent
Hanoi in different historical periods.
Landmark Recognition: Extract relevant features from the images. Convolu-
tional Neural Networks (CNNs) like VGG, ResNet, DenseNet and MobileNet
are used to capture distinctive characteristics of each landmark. Images are
preprocessed before feeding into networks for training. The accuracy and lightweight
of the model are considered for this task.
1.3 Contribution
1.3.1 Self-Captured Image Dataset
As mentioned above, the thesis aims to classify photos of tourist attractions in
Hanoi, necessitating the training of an image classification model using relevant
images. Currently, there is no available dataset for landmarks and attractions in
Hanoi, which has led to the creation of my own dataset. The images in this dataset
are taken by myself and from the internet. The dataset contains approximately 3108
images of 15 landmarks in Hanoi: Hang Dau Water Tank, The Huc Bridge, One
2
CHAPTER 1. OVERVIEW
Pillar Pagoda, Tran Quoc Pagoda, Lenin Park, Flag Tower of Hanoi, Quan Thanh
Temple, Ho Chi Minh Mausoleum, Hanoi Opera House, St. Joseph Cathedral (The
Big Church), Hoa Lo Prison, Turtle Tower, Ly Thai To Statue, Temple of Litera-
ture, Hang Dau Flower Garden.
The images are in different lighting conditions and angles to enrich the dataset
as shown in Chapter 4. After capturing the images, I performed necessary post-
processing tasks, such as image enhancement, noise reduction, and format stan-
dardization, to ensure consistency and quality in the dataset.
Building my Hanoi landmark dataset requires effort and resources. The sharing my
dataset with the research community can contribute to advancements in the fields
of computer vision and landmark recognition. Other researchers and developers
may find my dataset valuable for their projects.
In conclusion, the self-captured image dataset stands as a pivotal contribution to
this thesis. It reflects the dedication, precision, and commitment to producing high-
quality research results while ensuring that the data collected aligns closely with
the research objectives. This contribution enhances the overall quality and rele-
vance of the research findings.
1.3.2 Landmark recognition models
After collecting data, this thesis researches different approaches to find a suit-
able solution for the problem of attraction recognition. Various deep learning mod-
els were investigated such as: VGG16 [3], ResNet-50 [4], DenseNet [5], MobileNet_v2
[6]. These models were initially trained on ImageNet dataset with 1000 classes, so
for the purpose of classifying 15 landmarks, this thesis modifies the models so that
they fit with this problem.
The models which obtained highest accuracy were DenseNet and MobileNet_v2.
DenseNet achieved the highest accuracy of 99.6%, followed by MobileNet_v2
(98.84%), VGG16 (97.51%), and ResNet-50 (85.59%). DenseNet and MobileNet_v2
outperforms all other models in terms of accuracy. ResNet-50 has the largest model
size, consuming 89.98 Megabytes. DenseNet follows with a slightly smaller model
size of 69.89 Megabytes. VGG16 has a comparatively smaller model size, us-
ing 56.13 Megabytes. On the other hand, MobileNet_v2 stands out as the most
lightweight model, with a significantly smaller size of only 8.61 Megabytes.
1.4 Thesis organization
The remaining thesis is organized as follows:
Chapter 2 provides an overview of related work and background
Chapter 3 mentions the technology that the thesis takes advantage of, why they
were chosen and where to use them in the program, also how to set up and config-
3
CHAPTER 1. OVERVIEW
ure those technologies
Chapter 4 provides an overview of the dataset, describe how the dataset was col-
lected, and present statistical figures related to the dataset
Chapter 5 presents all the works and significant contributions, the problems during
the developing process of this product
Chapter 6 summarizes the thesis, what the product has achieved, the limitations,
and the future develop
4
CHAPTER 2. LITERATURE REVIEW
2.1 Related work
2.1.1 Landmark Recognition problems
There are various attractions in Hanoi and they create historical and cultural
highlights for this city. The capital of Vietnam welcomes a large number of for-
eign tour ists each year. Before the COVID-19 pandemic, it was estimated that over
seven million foreigners came to Hanoi in 2019. However, tourists may have dif-
ficulties recognizing those attractions; some of them do not have a signboard, or
their signboard contains only Vietnamese.
Currently, there is only one application for aiding tourists in recognizing attrac-
tions in Hanoi, which is Google Lens. There are some applications for supporting
tour ists in finding ways to provide information about Hanoi. Of all the applications
that have been investigated by this thesis, MeTrip - Trip planner and Sygic travel
maps are two outstanding applications that serve these purposes. Table 2.1 shows
the comparisons between the three applications.
Table 2.1: Comparison of Google Lens, MeTrip, and Sygic Travel Maps
Google lens Metr ip Sygic Travel
Landmark
recognition
Yes No No
Provide
attractions
information
Linked to Google search No Paid feature
Support Map
and navigation
Linked to Google Maps Linked to Google Maps Yes
Allow
upload photo
for recognition
No No No
Attractions
suggestion
No Yes Yes
Google Photos is a photo sharing and storage service developed by Google. It
was released in 2015 and is now hugely popular worldwide. It has many features,
Google Lens is one of the major ones. Google Lens is image recognition soft-
ware designed to bring up relevant infor mation using visual analysis. This feature
is available on Google Photos mobile applications only. When viewing a photo in
your Google Photos app on your mobile device, users can use Google Lens service
and then Google will serve up suggestions. Google Lens has a user-friendly in-
5
CHAPTER 2. LITERATURE REVIEW
terface, recognizes attractions in images quickly, and has good accuracy. However,
users must take a picture on their phone, then go to the Google Photos app to use it.
If users want to see information about the attraction, they have to click the button
to go to Google Maps or Google search. This requires other applications and extra
steps and is quite inconvenient sometimes.
The MeTrip App was built in mid-2016 in Ho Chi Minh City, Vietnam. MeTrip
App is a travel community that allows people to schedule travel, consult other mem-
bers schedules, and search for tourist attractions [7]. This app suggests tourists
based on the city and country they select, on the other hand, it does not provide any
information about attractions. The navigation feature also requires Google Maps.
Sygic Travel Maps is also an application for tourists, it displays the attractions,
hotels, restaurants, or shops directly on the map [8]. This app is able to locate
and navigate users to the desired destination, and suggest tourist landmarks nearby.
However, to read the information about the attractions, users have to pay for it as a
premium feature.
From the above comparison, this thesis proposes a method to identify attractions,
provide information about landmarks for free, and help users to navigate in Hanoi.
2.1.2 Landmark recognition dataset
Creating a dataset for landmark recognition involves reviewing related work in
the field to understand existing datasets and their contributions. Here are some
related works for landmark recognition datasets:
Google Landmarks Dataset [9]: Googles Landmarks dataset is one of the
most comprehensive and widely used datasets for landmark recognition. It
contains a vast collection of images of landmarks from around the world, mak-
ing it a valuable resource for training and evaluating recognition models.The
advantages google landmark dataset is extensive and contains images from
around the world. There may be some images related to Hanoi if users or re-
searchers have contributed.On the other hand, Disadvantages is does not focus
on a specific location, including Hanoi. The quality and resolution of images
may not be consistent.
Places365 [10]: While not specif ically for landmarks, the Places365 dataset
is rich in outdoor scenes and locations. Researchers have used this dataset to
train models for landmark recognition, especially for larger-scale scenes.
InLoc [11]: InLoc is a dataset specifically designed for indoor localization
and recognition. While it focuses on indoor landmarks, it can serve as a re-
source for studies involving indoor landmarks, which is a valuable addition to
6
CHAPTER 2. LITERATURE REVIEW
outdoor-focused datasets.
item Geolife GPS Trajectories [12]: Geolife is a dataset that contains GPS
trajectories of individuals, providing valuable location data. Researchers have
used this dataset to extract landmarks and location information for recognition
tasks.
The common drawbacks of the aforementioned datasets, including Google Land-
mark, Place365, and Inloc are their lack of focus on Hanoi and its specific land-
marks. Furthermore, Google Landmark Dataset may suffer from inconsistent im-
age quality and resolution. On the other hand, the Geolife GPS Trajectories dataset
pr imarily contains geographical information and GPS coordinates, without provid-
ing direct images of landmarks. Therefore, there is a need to construct a dedicated
dataset tailored to Hanoi.
2.1.3 Technical
Landmark recognition is a field of computer vision and image processing that
focuses on the identification and classification of well-known landmarks or notable
places in images or videos. This technology has numerous practical applications,
such as in tourism, augmented reality, geographic information systems, and cultural
her itage preservation. Researchers and developers have made significant progress
in landmark recognition, and here are some notable related works in this area:
Image Retrieval and Matching [13] [14]: Landmark recognition often involves
image retrieval and matching techniques. Systems use methods like SIFT (Scale-
Invariant Feature Transform), SURF (Speeded-Up Robust Features), and ORB
(Or iented FAST and Rotated BRIEF) for keypoint detection and feature match-
ing.
Geospatial Data Integration [15]: Some landmark recognition systems inte-
grate geospatial data such as GPS coordinates and 3D models to improve
accuracy and provide additional context. This is particularly useful for aug-
mented reality applications.
Fine-Grained Landmark Recognition [16]: Researchers are developing fine-
grained landmark recognition models that can distinguish between different
parts or details of a landmark. This level of recognition is crucial for certain
applications like art analysis.
Deep Learning Approaches [17] [4] MobileNet [18] MobileNet_v3 [5]: Deep
learning, particularly convolutional neural networks (CNNs), has revolution-
ized landmark recognition. Many state-of-the-art models leverage CNNs to
7
CHAPTER 2. LITERATURE REVIEW
extract features from images and achieve high accuracy in recognizing land-
marks.
2.2 Background
In this thesis, Utilize techniques like Convolutional Neural Networks (CNNs)
cnn-1 cnn-2 is used for feature extractions. The feature extraction network is us-
ing various pretrained architecture such as: ResNet-50, MobileNet_v2, VGG16,
DenseNet.
2.2.1 VGG16
VGG is the model that standardizes architecture design for deep learning net-
works. It was introduced in the paper Very Deep Convolutional Networks for LargeScale
Image Recognition VGG16 in 2014. It makes the improvement over AlexNet [19]
by replacing large kernel-sized filters 11 and 5 in the first and second convolutional
layers, respectively with multiple 3x3 kernel-sized filters one after another.
The VGG family networks use only 3x3 convolutional layers. The architecture of
VGG16 is shown in the Figure 2.1.
The reason this thesis investigates VGG is that VGG was the runner-up in the Im-
Figure 2.1: Architecture of VGG16 VGG16
ageNet Large Scale Visual Recognition Challenge in 2014 VGG16, this is a con-
test where software programs compete to correctly classify and detect objects and
scenes. The weight configuration of the VGGNet is publicly available and has been
used in many other applications and challenges as a baseline feature extractor. In
8
CHAPTER 2. LITERATURE REVIEW
general, the patter n of the VGG models have become the starting point for simple
convolutional neural networks.
For the purpose of classifying 15 landmarks, the feature extraction part of the
model is used, and add a new classifier that is fit to this landmarks dataset. Specif-
ically, the weights of all of the convolutional layers is frozen during training, and
new fully connected layers that will learn to interpret the features results from the
model are trained. The modification of VGG is as follows:
Specify that the input of the model is 224x224
Remove the classifier part of the model and add a Flatten layer to it.
Following are two fully connected layers: the first layer has 128 nodes and the
second one has 15 nodes to predict the probability of an image in 15 classes.
All layers will use the ReLU activation function and weight initialization.
The loss function is a categorical cross-entropy loss function because it is
suitable for multiclass classification. This thesis will use a conservative con-
figuration for the Adam optimizer with a default setting of a learning rate
equal to 0.0001
2.2.2 DenseNet
DenseNets require fewer parameters than an equivalent traditional CNN, as
there is no need to learn redundant feature maps. Figure 2.2 illustrates DenseNet
layout schematically. In DenseNet, never combine features through summation be-
fore passed into a layer. Instead, we combine features by concatenating them.
2.2.3 ResNet
The full ResNet architecture is described in the Figure The architecture includes:
Stack residual blocks
Every residual block has two 3x3 conv layers
Periodically, double of filters and downsample spatially using stride 2 (/2 in
each dimension)
Additional conv layer at the beginning (stem)
No FC layers at the end (only FC 1000 to output classes)
ResNet-18 and ResNet-50 are both variants of the Residual Network (ResNet)
architecture. They share common features of ResNet, as follows:
Residual Blocks: Both ResNet-18 and ResNet-50 use residual blocks as their
building blocks. These blocks contain convolutional layers, batch normaliza-
9
CHAPTER 2. LITERATURE REVIEW
Figure 2.2: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding
feature-maps as input [5]
Figure 2.3: A deep DenseNet with three dense blocks. The layers between two adjacent
blocks are referred to as transition layers and change feature-map sizes via convolution and
pooling [5]
tion, and skip connections
Skip Connections: Both architectures utilize skip connections (shortcut con-
nections) that allow information to bypass one or more layers. These connec-
tions are crucial for addressing the vanishing gradient problem
Building Depth: Both networks aim to achieve greater depth, enabling them
to capture more complex features and patterns in images
Besides, ResNet-18 and ResNet-50 have differences in architecture
Number of Layers: The most significant difference is the number of layers in
each network. ResNet-18 has 18 layers, while ResNet-50 has 50 layers. This
difference in depth allows ResNet-50 to capture more intricate features but
10
CHAPTER 2. LITERATURE REVIEW
Figure 2.4: Residual block [20]
also makes it computationally more intensive
Model Size: Due to the increased number of layers in ResNet-50, it is a larger
and more complex model compared to ResNet-18. As a result, ResNet-50 typ-
ically requires more memory and computational resources for training and
inference
Performance: ResNet-50 tends to outperform ResNet-18 on various image
classification tasks due to its increased depth. It can capture more fine-grained
details, resulting in higher accuracy.
Applications: ResNet-18 may be more suitable for simpler tasks or situations
with limited computational resources. ResNet-50 is preferred for tasks that
demand high accuracy and can benefit from a deeper network
In summary, the primary difference between ResNet-18 and ResNet-50 is the depth
and complexity of the models. ResNet-50 is deeper and generally achieves higher
accuracy but requires more computational resources.
2.2.4 MobileNet
In April 2017, Google published a research work called MobileNets [6]. Over
one year of development, In 2018 Google continued to publish a new version of
this MobileNet which is called MobileNet_v2 [18]. MobileNet_v2 is a significant
11
CHAPTER 2. LITERATURE REVIEW
Figure 2.5: Every residual block has two 3x3 conv layers [20]
improvement over MobileNet_v1 and pushes the state of the art for mobile visual
recognition including classification, object detection and semantic segmentation.
a, MobileNets
The main big idea behind the first version is Depthwise separable convolutions
[21]. A normal convolutional layer applies a filter to every channel of the image.
It slides this filter over the image and at each step, it performs a weighted sum of
the pixels covered by the filter across all channels. The convolution operation com-
bines all of the input channel values. If the image has 3 input channels, then the
result of r unning a convolution filter across this image is an image with only one
channel per pixel.
Therefore, the new output pixel is only written on a single channel as shown in
12
CHAPTER 2. LITERATURE REVIEW
Figure 2.6: Periodically double of filters and downsample F(x) spatially using stride 2(/2
in each dimension) [20]
Figure 2.8. MobileNets also uses this kind of convolution, but just in the first layer.
The rest of the layers use “depthwise separable convolution. This operation is com-
posed of 2 sub-operation which are called depthwise and pointwise. The depthwise
convolution is performed separately on each channel. For an image with 3 chan-
nels, a depthwise convolution creates an output image that also has 3 channels. It
helps to filter the input channels. Following depthwise convolution is pointwise
convolution. This is the same as a normal convolution but with a 1×1 filter, the
visualization is in Figure 2.9. This pointwise convolution will combine the output
channels of the depthwise convolution to create new features. The results of both a
regular convolution layer and a depthwise convolution layer are pretty similar. They
both create new features. But the regular convolution layer spends much more time
on computational work and learning more weights.
a, MobileNet_v2
Version 2 still uses depthwise separable convolutions, but in architecture, its main
13
CHAPTER 2. LITERATURE REVIEW
Figure 2.7: No FC layers at the end (only FC 1000 to output classes) [20]
building block has a change that is illustrated in Figure 2.10. The pointwise con-
volution in MobileNets version one makes the number of channels the same or
doubles them, whereas MobileNet version two reduces the number of channels.
This process is known as a projection layer. It projects data with a high number of
dimensions into a lower number of dimensions.
The second new thing about MobileNet v2 is the residual block. It is motivated
by ResNet [17] and helps with the flow of gradients through the network. The
architecture of MobileNet_v2 is show in Table 2.2.
a, MobileNetV3
MobileNetV3 is a convolutional neural network that is tuned to mobile phone CPUs
through a combination of hardware-aware network architecture search (NAS) com-
plemented by the NetAdapt algorithm, and then subsequently improved through
novel architecture advances Figure 2.11. Advances include (1) complementary
search techniques, (2) new efficient versions of nonlinearities practical for the mo-
bile setting, (3) a new efficient network design.
14
CHAPTER 2. LITERATURE REVIEW
Figure 2.8: Depthwise separable convolutions visualization [21]
Figure 2.9: Pointwise convolutions visualization [21]
The paper Searching for MobileNetV3 MobileNet_v3 presents the next generation
of MobileNets based on a combination of complementary search techniques as
well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs
through a combination of hardware- aware network architecture search (NAS) com-
plemented by the NetAdapt algorithm and then subsequently improved through
novel architecture advances. This paper starts the exploration of how automated
search algorithms and network design can work together to harness complemen-
tary approaches improving the overall state of the art. Through this process create
two new MobileNet models for re-leasing: MobileNetV3-Large and MobileNetV3-
15
CHAPTER 2. LITERATURE REVIEW
Figure 2.10: Bottleneck residual block of MobileNet v2 [18]
Table 2.2: MobileNet v2 architecture and configuration [22]
Input Operator t c n s
224 x 224 x 3 Conv2d - 32 1 2
112 x 112 x 32 Bottleneck 1 16 1 1
112 x 112 x 16 Bottleneck 6 24 2 2
56 x 56 x 24 Bottleneck 6 32 3 2
28 x 28 x 32 Bottleneck 6 64 4 2
14 x 14 x 64 Bottleneck 6 96 3 1
14 x 14 x 96 Bottleneck 6 160 3 2
7 x 7 x 160 Bottleneck 6 320 1 1
7 x 7 x 320 Conv2d 1x1 - 1280 1 1
7 x 7 x 1280 Avgpool 7x7 - - 1 1
1 x 1 x 1280 Conv2d 1x1 - k -
16
CHAPTER 2. LITERATURE REVIEW
Small which are targeted for high and low resource use cases.
Figure 2.11: Building on MobileNetV3, the proposed segmentation head, Lite R-ASPP,
delivers fast semantic segmentation results while mixing features from multiple resolutions
[23]
2.2.5 Loss Function
a, Cross Entropy Loss
Logistic Regression is a fundamental machine learning algorithm used for binary
and multi-class classif ication tasks. Its a simple yet powerful algorithm that works
by modeling the relationship between a binary dependent variable (target) and one
or more independent variables (features). Key Components:
Binary Outcome: In binary classification, Logistic Regression predicts one of
two possible outcomes, usually labeled as 0 and 1 (e.g., spam or not spam, yes
or no). In multi-class classification, it extends to predicting multiple classes,
often using the “one-vs-all“ or “softmax“ approach
Sigmoid Function: Logistic Regression uses the sigmoid (logistic) function to
transform a linear combination of input features into a value between 0 and 1
Model Parameters: Logistic Regression estimates the parameters (coefficients)
that define the linear relationship between input features and the log-odds of
the target variable. These parameters are learned during training
Training and Prediction: The training process of Logistic Regression involves find-
ing the best parameters that minimize the logistic loss (log loss) or cross-entropy
loss on the training data. This is typically achieved using optimization algorithms
like gradient descent. Once trained, the model can be used for predictions. Given a
new set of input features, the model computes the probability that the input belongs
to the positive class (for binary classification) or to each class (for multi-class clas-
sification). A threshold is then applied to make the final classification decision.
In summary, Logistic Regression is a versatile and widely used classification algo-
17
CHAPTER 2. LITERATURE REVIEW
r ithm with applications in various domains. Its simplicity and interpretability make
it an excellent choice for many classification tasks. Because data is a multiclass
classification so I choose cross-entropy loss. Cross-Entropy Loss, is commonly
chosen for multi-class classification tasks for several important reasons:
Mathematical Appropriateness: Cross-entropy loss is well-suited for multi-
class classification because it measures the dissimilarity between predicted
class probabilities and the true class labels. It quantifies how well the predicted
probability distribution aligns with the actual distribution of class labels
Probability Interpretability: Cross-entropy loss encourages the model to pro-
duce predicted probabilities that are close to 1 for the true class and close to
0 for other classes. This aligns with the natural interpretation of probability
distributions, making the results more interpretable
Gradient Descent Optimization: The gradient of the cross-entropy loss with
respect to model parameters is straightforward to compute, making it suitable
for optimization algorithms like gradient descent. Efficient optimization is
crucial for training deep neural networks, which are commonly used in multi-
class classification tasks
Overcoming Class Imbalance: Cross-entropy loss handles class imbalances
well. In scenarios where some classes have significantly more or fewer sam-
ples than others, cross-entropy loss can still provide effective gradients for
updating model parameters
Overcoming Class Imbalance: Cross-entropy loss handles class imbalances
well. In scenarios where some classes have significantly more or fewer sam-
ples than others, cross-entropy loss can still provide effective gradients for
updating model parameters
Scaling to Multiple Classes: Cross-entropy loss naturally extends to handle
more than two classes without significant modifications. It can handle scenar-
ios where you have three or more mutually exclusive classes
Logarithmic Scale: Cross-entropy loss operates on a logarithmic scale, which
means it heavily penalizes large prediction errors. This is important in clas-
sification tasks where small prediction errors are acceptable, but large errors
should be minimized
State-of-the-Art Performance: In practice, cross-entropy loss has been found
to work well in a wide range of multi-class classification tasks. It is often the
default choice because of its effectiveness and ease of use
18
CHAPTER 2. LITERATURE REVIEW
2.2.6 Supported Library
Table 2.3: Supported Library
No Supported Library URL
1 Numpy https://numpy.org/
2 Pandas https://pandas.pydata.org/
3 Tensorflow https://www.tensorflow.org/
4 Keras https://keras.io/
5 Sklearn https://scikit-learn.org/stable/
6 Matpolib https://matplotlib.org/
7 Seaborn https://seaborn.pydata.org/
a. Numpy
NumPy [24] is a fundamental open-source library in the Python programming
ecosystem that plays a pivotal role in numerical and scientific computing. It pro-
vides support for working with large, multi-dimensional arrays and matr ices, as
well as a collection of high-level mathematical functions to operate on these ar-
rays. NumPy is the foundation for numerous other libraries and tools used in data
analysis, machine learning, and scientific research. It not only enhances the com-
putational capabilities of Python but also offers the convenience of an extensive
and efficient data manipulation and analysis toolkit. NumPy’s ability to efficiently
handle numerical data and perform vectorized operations makes it an indispens-
able tool for professionals and researchers across a wide range of fields, from data
science and machine learning to physics, engineering, and finance.
b. Pandas
Pandas [25] is a powerful open-source data analysis and manipulation library in
Python. It offers high-performance, easy-to-use data structures and data analysis
tools designed to simplify working with structured data. At the core of Pandas are
two primary data structures: DataFrames, which resemble tables or spreadsheets,
and Series, which are one-dimensional arrays. These structures are versatile and
allow users to load, clean, reshape, and analyze data effectively. Pandas excels at
handling real-world data, dealing with missing values, and providing powerful data
aggregation and manipulation capabilities. With its wide range of functions for
data indexing, selection, and transformation, Pandas is an indispensable tool for
data scientists, analysts, and researchers in numerous domains, including finance,
economics, social sciences, and more. Its seamless integ ration with other Python
librar ies, like NumPy and Matplotlib, makes Pandas a cornerstone for data analysis
and manipulation in the Python ecosystem.
c. Tensorflow
19
CHAPTER 2. LITERATURE REVIEW
TensorFlow [26] is an open-source machine learning framework that has gained
widespread popularity for its flexibility and robustness. Developed by Google Brain,
TensorFlow provides a comprehensive ecosystem of tools, libraries, and commu-
nity resources that facilitate the development of machine learning and deep learn-
ing models. It allows researchers and developers to build, train, and deploy machine
learning models for a wide range of applications, from image and speech recogni-
tion to natural language processing and reinforcement learning. One of Tensor-
Flow’s key strengths is its support for both high-level APIs, like Keras for quick
model prototyping, and low-level APIs, offering fine-grained control over model
architecture and training. TensorFlow’s versatility makes it suitable for both begin-
ners and experts in the field of machine learning.
d. Keras
Keras [27] is an open-source high-level neural networks API written in Python.
It is designed to be user-friendly, modular, and easy to use, making it a popular
choice for both beginners and experts in the field of deep learning and artificial
intelligence. Keras acts as an interface for the TensorFlow library, as well as other
popular deep learning frameworks, allowing users to quickly and efficiently build
and train neural networks for various tasks, such as image classification, natural
language processing, and more. One of the key advantages of Keras is its simplic-
ity and abstraction. It provides a straightforward and intuitive way to define and
configure neural network models, with a focus on enabling rapid experimentation
and prototyping. With Keras, users can construct complex network architectures
by simply stacking together layers and specifying their configurations. Keras sup-
ports both CPU and GPU acceleration, making it suitable for various computing
environments. Additionally, it offers a wide range of pre-trained models and tools
for transfer learning, allowing users to leverage the knowledge gained from exist-
ing models to solve new and specific problems. Overall, Keras is a versatile and
accessible library for building and training deep learning models.
e. Sklearn
Scikit-Learn [28], commonly referred to as sklearn, is a renowned open-source
machine learning library for Python. It provides a wide array of efficient and user-
fr iendly tools for various aspects of machine learning, including classification, re-
gression, clustering, dimensionality reduction, and model selection and evaluation.
Scikit-Learn is characterized by its clean, consistent API, which makes it easy to
use, especially for those new to machine learning. It offers an extensive selection of
algor ithms for different tasks and supports the entire machine learning workflow,
from data preprocessing to model training and evaluation. With its comprehensive
20
CHAPTER 2. LITERATURE REVIEW
documentation and a large community of users and developers, Scikit-Learn has
established itself as an indispensable library for machine learning and data science
applications, empowering users to tackle complex problems and make informed
decisions based on data-driven insights.
f. Matpolib
Matplotlib [29] is a widely used open-source data visualization library for the
Python programming language. It provides a versatile and comprehensive platform
for creating high-quality charts, graphs, and visual representations of data. With
Matplotlib, users can generate a wide range of plot types, including line plots,
scatter plots, bar plots, histograms, and more, making it an essential tool for data
scientists, researchers, and analysts. Matplotlib’s customizable nature allows users
to fine-tune the appearance and aesthetics of their visualizations, from colors and
labels to titles and annotations. Whether for exploratory data analysis, scientific re-
search, or data presentation, Matplotlib’s flexibility and extensive documentation
have made it a staple in the Python data visualization landscape. Additionally, it
ser ves as the foundation for various higher-level data visualization libraries, fur-
ther expanding its utility in the field of data science and beyond.
g. Seaborn
Seaborn [30] is a powerful Python data visualization library built on top of Mat-
plotlib. It is widely used for creating informative and attractive statistical graphics.
Seaborn provides a high-level interface for drawing visually appealing and infor-
mative statistical graphics. This library comes with several built-in themes and
color palettes to improve the aesthetics of plots and charts. Seaborn simplifies the
process of creating complex visualizations, such as heatmaps, pair plots, and bar
plots, making it a valuable tool for data exploration and analysis. It seamlessly in-
tegrates with data structures from both Pandas and NumPy, allowing for easy data
manipulation and visualization. Seaborn is an indispensable tool for enhancing the
visual appeal and interpretability of plots.
21
CHAPTER 3. PROPOSED METHOD
3.1 Overview
Figure 3.1: System overview
In the overview of proposed method for Hanoi landmark recognition, we follow
a systematic process to achieve our goals. This process involves multiple stages,
each contributing to the overall success of the landmark recognition system. The
initial step is “Data Preprocessing“. In this phase, the dataset comprising images,
textual information, and potentially geographical coordinates of Hanoi landmarks
is carefully cleaned, organized, and prepared for analysis. This includes tasks such
as data cleaning to remove noise, normalizing image resolutions, and structuring
textual data for consistency. Additionally, data augmentation techniques may be
employed to enhance the datasets diversity, ensuring that the model is robust to
variations in lighting, perspective, and environmental conditions.
Following data preprocessing is “Feature Extraction“. In the context of deep learn-
ing, this step typically involves using convolutional neural networks (CNNs) to
automatically learn and extract relevant features from images. These features may
encompass architectural details, distinctive visual characteristics, and unique pat-
terns that define each landmark. The goal of feature extraction is to reduce the di-
mensionality of the data while retaining the most important information, allowing
the model to effectively distinguish between different landmarks based on visual
cues.
“Landmark Recognition“ is the phase where the deep learning model is trained
to identify and classify Hanoi landmarks based on the extracted features. A fully
connected layer (Dense) with 128 units and ReLU activation is added for feature
transformation. Finally, an output layer with 15 units (corresponding to 15 classes)
and softmax activation is added for classif ication.
3.2 Data Preprocessing
In the context of Hanoi landmark recognition, the Data Preprocessing stage
plays a pivotal role in ensuring the quality and consistency of the dataset. This
22
CHAPTER 3. PROPOSED METHOD
Figure 3.2: Data Preprocessing
essential step involves a series of operations and transformations applied to the
collected images before they are used for feature extraction and landmark recogni-
tion. Regarding the detailed information in the dataset, it is elaborated extensively
in Chapter 4.
Data Cleaning: This step includes the removal of any irrelevant or duplicate
images, eliminating corrupted files, and addressing any inconsistencies or ar-
tifacts present in the images. Data cleaning helps ensure the dataset is free
from noise and outliers.
Resizing and Standardization: To ensure a consistent input to the model, all
images are resized to a uniform resolution of 224x224 pixels. This standard-
ization enables the neural network to process images of the same size, simpli-
fying the subsequent feature extraction process.
Labeling: After capturing images at various landmarks, the process involves
classifying the images into separate folders and labeling them accordingly.
Each image is assigned the correct class for the landmark it represents, an
essential step for training a supervised learning model that can accurately rec-
ognize these landmarks later.
Data Splitting: After labeling, separate images into training, validation, and
test sets by a ratio of 7:2:1
Data preprocessing for Hanoi landmark recognition is a critical foundation for the
subsequent stages of feature extraction and landmark recognition. It guarantees that
the dataset is well-structured, standardized, and ready for the application of deep
learning techniques in the pursuit of accurate landmark identification.
3.3 Feature Extractions
In this context, the feature extraction input is image data with a shape of (224,
224, 3). And the output is extracted features from the CNNs model. These features
are processed through layers like Flatten, Dense (128), and Dense (15) with soft-
23
CHAPTER 3. PROPOSED METHOD
Figure 3.3: Feature Extractions
Figure 3.4: Feature Extractions Details
max activation, resulting in a probability distribution over 15 classes. Feature engi-
neer ing using Convolutional Neural Networks (CNNs) is a process of learning and
extracting meaningful features from images using deep learning techniques. CNNs
are particularly well-suited for feature engineering in computer vision tasks due to
their ability to automatically learn hierarchical features from raw image data.
In this thesis, various pre-trained CNN models have been experimented such as
ResNet-50 [17], MobileNet_v2 [18], VGG16 [3] and DenseNet [5]. MobileNet_v2
possesses state-of-the-art design, an efficient and lightweight architecture, which
results in quicker training and deployment. Despite being relatively lightweight,
MobileNet_v2 maintains a high level of accuracy, rendering it well-suited for a
broad spectrum of image classification tasks. It is possible to initialize MobileNet_v2
with pre-trained weights on extensive datasets like ImageNet. And that is the rea-
son MobileNet_v2 is chosen for the Hanoi landmark recognition task.
Besides, the choice of the DenseNet model for image classification can be at-
tr ibuted to several compelling reasons. First and foremost, DenseNet has demon-
strated outstanding perfor mance in various computer vision tasks and is renowned
24
CHAPTER 3. PROPOSED METHOD
for its ability to capture intricate features and patterns in images. Its densely con-
nected architecture fosters feature reuse and encourages gradient flow through-
out the network, which leads to improved training efficiency and faster conver-
gence. Moreover, DenseNet exhibits impressive accuracy in image classification
tasks while being relatively lightweight compared to some other complex models,
which is especially valuable when deploying models in resource-constrained envi-
ronments. Additionally, DenseNets hierarchical feature extraction helps in cap-
tur ing both low-level and high-level features, making it a suitable choice for a
wide range of image classification tasks. The combination of these factors makes
DenseNet an excellent candidate for achieving high accuracy in image classifica-
tion while maintaining computational efficiency.
ResNet-50 is a deep convolutional neural network architecture that has consistently
demonstrated outstanding performance in various computer vision tasks, including
image classification. Its deep structure allows it to capture intricate features and
patterns in images, enabling it to excel in tasks that require high levels of accu-
racy. Additionally, ResNet-50 is pre-trained on a large-scale dataset, namely Ima-
geNet, which provides it with a strong foundation in image understanding. This pre-
training imparts valuable knowledge about various objects, shapes, and textures,
making it easier to fine-tune the model for specific image classification tasks.
In the end, this thesis chose VGG16 for the following reasons: VGG16 is pre-
trained on a large dataset like ImageNet, which means it has already learned to
recognize a broad range of features from diverse images. Leveraging transfer learn-
ing from VGG16 can significantly speed up training and improve performance on
smaller datasets. VGG16’s architecture is relatively simple and intuitive, making
it easy to implement and adapt for specific classification tasks. It has been a key
player in many winning solutions for image recognition competitions. The archi-
tectures of these networks are elaborated on in detail in Section 2.2 - background.
After choose from various pre-trained model architectures, loading the pre-trained
model using the VGG16/DenseNet/ResNet-50/MobileNet_v2 function from Keras.
The include top parameter is set to False, which means the fully connected layers
(classification layers) at the top of the CNNs model are not included. This allows
the model to be used for feature extraction. After that, freezing process ensures that
the weights of the pre-trained CNNs model layers are not updated during training,
and they act as feature extractors only.
3.4 Landmark Recognition
On top of the pre-trained layers, custom classification layers are added. A Flat-
ten layer is used to convert the feature maps from the model into a 1D vector. Then,
25
CHAPTER 3. PROPOSED METHOD
Figure 3.5: Landmark Recognition
Figure 3.6: Landmark Recognition Details
a fully connected layer (Dense) with 128 units and ReLU activation is added for
feature transformation. Finally, an output layer with 15 units (corresponding to 15
classes) and softmax activation is added for classification.
The model is compiled with the specified optimizer, loss function, and evaluation
metr ics. In this case, the Adam optimizer is used, cross-entropy loss is chosen as
the loss function, and the model’s accuracy and top-5 accuracy are used as metrics.
The learning rate (0.001) used with the Adam optimizer is appropriate for fine-
tuning a pre-trained model. It’s a common practice to start with a relatively small
learning rate for fine-tuning to avoid overfitting. Adam’s adaptive learning rate can
help navigate the complex parameter space efficiently. Adam is known for working
well with cross-entropy loss, making it a suitable choice for this kind of problem
[31] .
In the thesis, Landmark recognition is regarded as a multi-class classification task,
and the selection of cross-entropy loss is based on several key considerations.
Cross-Entropy Loss, also known as CrossEntropyLoss, is widely preferred for multi-
class classification problems due to its signif icant advantages and relevance in such
contexts.
26
CHAPTER 3. PROPOSED METHOD
Mathematical Appropriateness: Cross-entropy loss is well-suited for multi-
class classification because it measures the dissimilarity between predicted
class probabilities and the true class labels. It quantifies how well the predicted
probability distribution aligns with the actual distribution of class labels.
Probability Interpretability: Cross-entropy loss encourages the model to pro-
duce predicted probabilities that are close to 1 for the true class and close to
0 for other classes. This aligns with the natural interpretation of probability
distributions, making the results more interpretable.
Gradient Descent Optimization: The gradient of the cross-entropy loss with
respect to model parameters is straightforward to compute, making it suitable
for optimization algorithms like gradient descent. Efficient optimization is
crucial for training deep neural networks, which are commonly used in multi-
class classification tasks.
Overcoming Class Imbalance: Cross-entropy loss handles class imbalances
well. In scenarios where some classes have significantly more or fewer sam-
ples than others, cross-entropy loss can still provide effective gradients for
updating model parameters.
Scaling to Multiple Classes: Cross-entropy loss naturally extends to handle
more than two classes without significant modifications. It can handle scenar-
ios where you have three or more mutually exclusive classes.
Logarithmic Scale: Cross-entropy loss operates on a logarithmic scale, which
means it heavily penalizes large prediction errors. This is important in clas-
sification tasks where small prediction errors are acceptable, but large errors
should be minimized.
State-of-the-Art Performance: In practice, cross-entropy loss has been found
to work well in a wide range of multi-c lass classification tasks. It is often the
default choice because of its effectiveness and ease of use.
27
CHAPTER 4. DATASET
4.1 Overview
Each dataset includes 3108 images with 15 classes (Table 4.1) are splitted into
train, validation and test by a ratio of 7:2:1 respectively
Figure 4.1: Data Over view
4.2 Data Collection
In the data collection phase, two tasks were performed: gathering images and
subsequently labeling them.
Data Gathering: The process began by meticulously selecting these landmarks,
which included both iconic and historically significant sites, such as the Ho
Chi Minh Mausoleum, the Temple of Literature, and the Flag Tower of Hanoi.
Once the landmarks were chosen, the fieldwork commenced. Our team visited
each location during different times of the day, taking photographs under var-
ious lighting conditions, including early morning (Figure 4.2), sunset (Figure
4.3) and evening (Figure 4.4). This diverse collection of images aimed to en-
sure a comprehensive dataset for subsequent landmark recognition research.
The images were captured using high-resolution cameras, and attention was
given to framing and composition to provide clear and informative visual data.
This extensive data collection process was a fundamental step in building my
28
CHAPTER 4. DATASET
Table 4.1: Overview 15 classes dataset
No English name
1 Flag Tower of Hanoi
2 Hang Dau Flower Garden
3 Hang Dau Water Tank
4 Hanoi Opera House
5 Ho Chi Minh mausoleum
6 Hoa Lo Prison
7 Lenin Park
8 Ly Thai To Statue
9 One Pillar Pagoda
10 Quan Thanh Temple
11 St. Joseph Cathedral (The Big Church)
12 Temple of Literature
13 The Huc Bridge
14 Tran Quoc Pagoda
15 Turtle Tower
landmark recognition dataset.
Data Labeling: Each image is associated with a label or class identifier. This
step involves annotating the images with the cor rect category they belong to.
Because proper labeling is crucial for training and evaluating a classification
model, the collected data was divided into 15 different directories, each la-
beled accordingly.
4.3 Data Statistics
In this section, the dataset will be presented through statistical tables (Table 4.2,
Table 4.3) and figures (Figure 4.6) to offer a clearer understanding of the created
dataset. The total dataset is 3108 images include 2168 images in the train dataset,
Table 4.2: Statistical information about the dataset
No Summary Detail
1 Total Number of Samples 3108
2 Number of Train Samples 2168
3 Number of Validation Samples 617
4 Number of Test Samples 323
5 Number of Classes 15
6 Image Size Distribution 224 x 224
617 images in the validation dataset, 323 images in the test dataset.
29
CHAPTER 4. DATASET
Figure 4.2: One Pillar Pagoda
Figure 4.3: Hanoi Opera House
30
CHAPTER 4. DATASET
Figure 4.4: The Huc Brige
Table 4.3: Number of samples per each class
No Class Number of samples
1 Flag Tower of Hanoi 361
2 Hang Dau Flower Garden 116
3 Hang Dau Water Tank 147
4 Hanoi Opera House 315
5 Ho Chi Minh mausoleum 181
6 Hoa Lo Prison 295
7 Lenin Park 144
8 Ly Thai To Statue 130
9 One Pillar Pagoda 359
10 Quan Thanh Temple 100
11 St. Joseph Cathedral (The Big Church) 112
12 Temple of Literature 340
13 The Huc Bridge 174
14 Tran Quoc Pagoda 88
15 Turtle Tower 246
31
CHAPTER 4. DATASET
Figure 4.5: Images of 15 attractions from the dataset
32
CHAPTER 4. DATASET
Figure 4.6: Distribution of image classes
33
CHAPTER 5. EXPERIMENTS
5.1 Environment
Google colabs Pro: Python 3, T4 GPU (T4 GPU within the Google Colab envi-
ronment to perform computationally demanding tasks).
5.2 Experiment
In the context of image classification benchmarking, several evaluation met-
r ics are employed to thoroughly assess model performance. These metrics include
Accuracy and Loss. These metrics include Accuracy, Top-1 Accuracy, Top-5 Ac-
curacy and Loss function. Accuracy measures the overall correctness of class pre-
dictions, while Top-1 Accuracy focuses on the percentage of images for which the
correct class is predicted as the top choice. Additionally, Top-5 Accuracy evaluates
how often the correct class falls within the top five predicted classes. These mul-
tiple metrics provide a comprehensive view of the model’s classification abilities,
enabling a more nuanced evaluation of its effectiveness in recognizing and catego-
r izing images.
Loss function is a metric that quantifies the error or dissimilarity between the pre-
dicted values and the actual values. It is used to assess how well the model’s predic-
tions align with the true values. The goal is typically to minimize this loss function.
In this thesis cross-entropy loss is used. Lower loss values indicate a better fit be-
tween predictions and true values.
Table 5.1 are the hyperparameter values used in the thesis. The Dense Layer 1 uses
Table 5.1: Hyperparameter
Layer Hyper Parameters Value
Dense Layer1 Units 128
Dense Layer1 Activation ReLu
Dense Layer2 Units 15
Dense Layer2 Activation Softmax
Dense Layer2 Optimized Adam (0.001)
Dense Layer2 Loss Function Cross Entropy Loss
128 units with the ReLU activation function. Dense Layer 2 uses 15 units with the
softmax activation function. The choice of 128 units in Dense Layer 1 is a value
selected based on experimentation and model tuning to achieve a balance between
model complexity and training capability without encountering overfitting issues.
This decision also takes into account the actual number of samples collected in the
dataset for the problem. The ReLU activation function is used to minimize com-
34
CHAPTER 5. EXPERIMENTS
putational load and enhance the model’s ability to learn non-linear features in the
neural network.
In Dense Layer2, The number of neurons in the second dense layer is set to 15.
The f inal Dense Layer in the model is designed with 15 units because the desired
number of output classes is 15 (equivalent to the goal of recognizing 15 classes
of landmarks in Hanoi). The activation function used in Dense Layer2 is Softmax.
Softmax is often used in the output layer of a neural network for multi-class clas-
sification, as it converts the network’s raw output into probability scores for each
class. The optimization algorithm chosen is Adam with a learning rate of 0.001.
Adam is a popular optimization algorithm that adapts the learning rates of each
parameter individually. Cross Entropy Loss is the chosen loss function for Dense
Layer2. Cross Entropy Loss is suitable for classification problems, measuring the
performance of a classification model whose output is a probability value between
0 and 1. This thesis conducted experiments with four different models. Through the
experimental process, model DenseNet and Mobilenet_v2 had the highest results.
Below are some tables and pictures of the models with the best results.
Table 5.2: The best acccuracy and execution times of Models
Models Max Accuracy Epoch
Execution
times (s)
Top1_acc Top5_acc
ResNet-50 85.59 134 971 85.59 98.01
VGG16 97.51 61 1124 97.51 100
DenseNet 99.66 30 649 99.66 100
MobileNet_v2 98.84 18 628 98.84 100
Table 5.2 and Figure 5.1 provides a comparison of the best performance met-
r ics for four different models. DenseNet achieved the highest accuracy of 99.6%,
followed by MobileNet_v2 (98.84%), VGG16 (97.51%), and ResNet-50 (85.59%).
DenseNet and MobileNet_v2 outperforms all other models in terms of accuracy. A
lower accuracy for the ResNet-50 model is attributed to the fact that deep models,
such as ResNet, typically necessitate more extended training durations to achieve
their maximum potential.
MobileNet_v2 required the fewest epochs (18) to reach its high accuracy. DenseNet
also converged relatively quickly (30 epochs), while VGG16 (61 epochs) and ResNet-
50 (134 epochs) took more time.
MobileNet_v2 and DenseNet had significantly shorter training times (628 and 649
seconds, respectively). ResNet-50 took the longest time to train, with an execution
35
CHAPTER 5. EXPERIMENTS
Figure 5.1: Comparison between top-1 accuracy and model size
time of 971 seconds. VGG16 had a relatively long execution time of 1124 seconds.
Three models achieved a Top-5 accuracy of 100%. except ResNet-50 achieved Top-
5 accuracy was 98.01. Based on the analysis of these criteria, DenseNet stands out
as the best-per forming model, achieving the highest accuracy with the fewest train-
ing epochs. However, consider both computational efficiency and accuracy overall,
MobileNet_v2 achieves high accuracy with shorter execution time. ResNet-50 and
VGG16, although achieving significant accuracy, require longer training time.
Present more details about the experimental results. ResNet-50, being a deep neu-
ral network, often requires a large amount of data to generalize well. With only
3108 samples, the dataset might be insufficient for training a complex model like
ResNet-50. That is the reason why ResNet-50 has the lowest accuracy among the
four models experimented.
Table 5.3: The weight of the models (MB)
Models The weight of the models (MB)
ResNet-50 89.98
DenseNet 69.89
VGG16 56.13
MobileNet_v2 8.61
Table 5.3 and Figure 5.2 provide insights into the varying model sizes of ResNet-
50, DenseNet, VGG16, and MobileNet_v2. ResNet-50 has the largest model size,
consuming 89.98 Megabytes. DenseNet follows with a slightly smaller model size
36
CHAPTER 5. EXPERIMENTS
Figure 5.2: Comparison between execution time and model size
of 69.89 Megabytes. VGG16 has a comparatively smaller model size, using 56.13
Megabytes. On the other hand, MobileNet_v2 stands out as the most lightweight
model, with a significantly smaller size of only 8.61 Megabytes. These findings
highlight the trade-off between model size and computational efficiency, which is
essential when deploying models on resource-constrained devices or in scenarios
with limited storage capacity. The experiments were tested on the validation set
with 617 images.
In conclusion, when examining the results of accuracy, loss function, execution
time, and the weight of the models in Table 5.2, Table 5.3, Figure 5.6, it is evident
that MobileNet_v2 achieved the best overall performance. It excels in terms of ac-
curacy, with the shortest execution time and the lightest model weight. Through the
practical results presented in the thesis, it is demonstrated that the MobileNet_v2
is the most suitable choice for the Hanoi landmark recognition task.
37
CHAPTER 5. EXPERIMENTS
Figure 5.3: Accuracy and Loss function of ResNet-50 [17] [4]
Figure 5.4: Accuracy and Loss function of VGG16 [3]
38
CHAPTER 5. EXPERIMENTS
Figure 5.5: Accuracy and Loss function of DenseNet [5]
Figure 5.6: Accuracy and Loss function of MobileNet_v2 [18]
39
CHAPTER 6. CONCLUSION AND FUTURE WORK
6.1 Conclusion
Following guidance from Dr. Dang Tuan Linh, a program has been successfully
developed in this thesis, meeting the predefined objectives effectively.
Build a dataset for 15 landmarks in Hanoi with 3108 samples
Find a suitable solution for the landmarks dataset. The accuracy of the classi-
fier works well with test images, it reached greater than 95% on accuracy
This thesis has researched a method for identifying Hanoi landmark attractions.
Although the initial objectives have been achieved, there is still a need for signif-
icant further refinement in the future. The dataset is small due to the limited time
for collection.
After completed the project, surveys were conducted to collect feedback from
fr iends and colleagues. Along with the positive feedback, numerous valuable sug-
gestions were received for further topic enhancement, such as expanding the exist-
ing dataset and exploring various methods to optimize the model.
6.2 Future work
Despite our efforts, there are still many areas for improvement in the thesis.
The experimental results of the thesis have demonstrated that the MobileNet_v2
model achieves high performance in terms of accuracy, execution time, and model
weight. This paves the way for future directions, such as developing mobile ap-
plications that allow tourists to capture and upload images for image recognition.
Additionally, these applications can incorporate recommendation features, provid-
ing relevant information about the identified landmarks. The new dataset primarily
focuses on landmarks in Hanoi. In the future, there may be opportunities to ex-
pand this dataset to include famous landmarks throughout Vietnam. Enriching the
dataset further would open up broader possibilities for future research endeavors.
Explor ing new approaches to optimize the model.
From the researched landmark recognition model, in the future, I propose a solu-
tion for transforming it into an application for landmark recognition in Hanoi as
follows:
User: In user interface, allow user upload image or take the image on the phone
Application: Send the image to server
Server: Receive and recognize the image by call "Hanoi Landmark Recogni-
40
CHAPTER 6. CONCLUSION AND FUTURE WORK
Figure 6.1: Proposed Solution
tion" function
Server: Send back result to application
Application: Receive result, display in interface
User: Press button to see landmark information
41
REFERENCE
[1] T. cục thống kê, Doanh thu dịch vụ tăng mạnh trong mùa cao điểm du lịch
2023. [Online]. Available: https://www.gso.gov.vn/du-lieu-
va - so - lieu - thong - ke/ 2023 / 08 / doanh - thu - dich - vu -
tang-manh-trong- mua-cao-diem-du- lich-he-2023/ (vis-
ited on 10/13/2023).
[2] C. thông tin điện tử Bộ Văn hoá Thể thao Du lịch, Hơn 1 triệu lượt
khách quốc tế đến việt nam trong tháng 9 năm 2023. [Online]. Available:
https : / / bvhttdl . gov. vn / hon - 1 - trieu - luot - khach -
quoc - te - den - viet - nam - trong - thang - 9 - nam - 2023 -
20230929145213177.htm (visited on 10/13/2023).
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition, arXiv preprint arXiv:1409.1556, 2014.
[4] S. Mascarenhas and M. Agarwal, “A comparison between vgg16, vgg19 and
resnet50 architecture frameworks for image classification, in 2021 Interna-
tional conference on disruptive technologies for multi-disciplinary research
and applications (CENTCON), 2021, pp. 96–99.
[5] L. Z. V. D. M. L. W. K. Q. Huang G., “Densely connected convolutional net-
works, in IEEE In Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 4700–4708.
[6] Z. M. C. B. K. D. W. W. W. T. A. Howard A. G., “Mobilenets: Efficient
convolutional neural networks for mobile vision applications, arXiv preprint
arXiv:1704.04861, 2017.
[7] Metrip. [Online]. Available: https : / / www. metrip . com (visited on
12/15/2022).
[8] Travel planner. [Online]. Available: https://travel.sygic.com/
en (visited on 12/15/2022).
[9] Google, Google landmark v2. [Online]. Available: https://research.
google/resources/datasets/google- landmarks- v2/ (vis-
ited on 10/13/2023).
[10] Place 365. [Online]. Available: http://places2.csail.mit.edu/
download.html (visited on 10/13/2023).
[11] Inloc dataset. [Online]. Available: https://github.com/HajimeTaira/
InLoc_dataset (visited on 10/13/2023).
42
REFERENCE
[12] Trajectory dataset. [Online]. Available: https : / / www. microsoft .
com/en-us/research/publication/geolife-gps-trajectory-
dataset-user-guide/ (visited on 10/13/2023).
[13] G Lowe, “Sift-the scale invariant feature transform, Int. J, vol. 2, no. 91-110,
p. 2, 2004.
[14] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:
Learning global representations for image search, in Computer Vision–ECCV
2016: 14th European Conference, Amsterdam, The Netherlands, October 11-
14, 2016, Proceedings, Part VI 14, 2016, pp. 241–257.
[15] K. Sun, Y. Zhu, P. Pan, et al., “Geospatial data ontology: The semantic foun-
dation of geospatial data integration and sharing, Big Earth Data, vol. 3,
no. 3, pp. 269–296, 2019.
[16] Z. Huang and Y. Li, “Interpretable and accurate fine-grained recognition via
region grouping, in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 8662–8672.
[17] S. M. C. G. C. L. C. C. B. T. M. A. H. Howard A., “Deep residual learning
for image recognition, in IEEE In Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 770–778.
[18] e. a. Sandler Mark, “Mobilenetv2: Inverted residuals and linear bottlenecks,
in IEEE Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 4510–4520.
[19] I. S. Krizhevsky Alex and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks, Advances in neural information processing
systems 25, 2012.
[20] S. University, Cnn architectures. [Online]. Available: http://cs231n.
stanford.edu/slides/2022/lecture_6_jiajun.pdf (visited
on 10/15/2023).
[21] R. at Google, Google’s mobilenets on the iphone. [Online]. Available: https:
//machinethink.net/blog/googles-mobile-net-architecture-
on-iphone/ (visited on 10/15/2023).
[22] Mobilenet v2 implementation in pytorch. [Online]. Available: https://
pytorch.org/hub/pytorch_vision_mobilenet_v2/ (visited
on 10/15/2023).
[23] Z. X. R. S. S. J. He K., “Searching for mobilenetv3, in IEEE In Proceed-
ings of the IEEE/CVF international conference on computer vision, 2016,
pp. 1314–1324.
[24] Numpy. [Online]. Available: https://numpy.org/ (visited on 10/15/2023).
43
REFERENCE
[25] Pandas. [Online]. Available: https://pandas.pydata.org/ (visited
on 10/15/2023).
[26] Tensorflow. [Online]. Available: https : / / www. tensorflow. org/
(visited on 10/15/2023).
[27] Keras. [Online]. Available: https://keras.io/ (visited on 10/15/2023).
[28] Sklearn. [Online]. Available: https://scikit-learn.org/stable/
(visited on 10/15/2023).
[29] Matplotlib. [Online]. Available: https://matplotlib.org/ (visited
on 10/15/2023).
[30] Seaborn. [Online]. Available: https://seaborn.pydata.org/ (vis-
ited on 10/15/2023).
[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
44